| UDI | Product ID | Type | Air temperature [K] | Process temperature [K] | Rotational speed [rpm] | Torque [Nm] | Tool wear [min] | Target | Failure Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M14860 | M | 298.1 | 308.6 | 1551 | 42.8 | 0 | 0 | No Failure |
| 1 | 2 | L47181 | L | 298.2 | 308.7 | 1408 | 46.3 | 3 | 0 | No Failure |
| 2 | 3 | L47182 | L | 298.1 | 308.5 | 1498 | 49.4 | 5 | 0 | No Failure |
| 3 | 4 | L47183 | L | 298.2 | 308.6 | 1433 | 39.5 | 7 | 0 | No Failure |
| 4 | 5 | L47184 | L | 298.2 | 308.7 | 1408 | 40.0 | 9 | 0 | No Failure |
Uncovering Patterns and Anomalies in Manufacturing Data
INFO 523 - Final Project
Abstract
In recent decades, industry continues modernizing processes and equipment. This is creating enormous amounts of data that in many cases might be underutilized. Vast and rich data from machines, like temperatures, vibration, and pressure, are constantly monitored. Traditional statistical process control is vastly used to oversee key process parameters and provide feedback to technicians when something might be behaving abnormally. In recent years, with the explosion of AI, industry has been looking at different approaches to monitor these parameters and use different techniques to more effectively predict or detect defects in products or problems in the machines.
This study focuses on using machine learning algorithms like random forest and gradient boosting to predict failures and the type of failure. Real factory data has noise, interactions with many factors, and is highly imbalanced. Machines are expected to run all the time without failure, and processes ideally will produce products without any defects, which makes data highly skewed toward a good state and just a few failures. Tuning models to handle this imbalance is critical in the factories. Different sampling methods were evaluated: creating synthetic data to over-sample the negative, under-sampling the positive, and giving weights are approaches that can be used.
A second study was done on time-series data, where algorithms like ARIMA and LSTM were used to detect outliers over time. One big challenge is that manufacturing KPIs are typically centered around a target, and variation is random, ideally following a normal distribution but not necessarily following a pattern, which limits the use of these algorithms to predict results. However, results can still be used to detect outliers, but this might not be the best approach.
Background
To explore machine algorithms to detect fails, a dataset from Kaggle was used. The data “Machine Predictive Maintenance Classification is a synthetic dataset that reflects a real use case in the industry, based on the source. The dataset consists of 10000 rows with 14 different features.
Key Features: (Source: https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification)
UID: unique identifier ranging from 1 to 10000
productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number
Type: is a columns that consist only on the letters L, M and H from productID.
air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
rotational speed [rpm]: calculated from powepower of 2860 W, overlaid with a normally distributed noise
torque [Nm]: torque values are normally distributed around 40 Nm with an σ = 10 Nm and no negative values.
tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.
Machine failure: label that indicates, whether the machine has failed in this particular data point for any of the following failure modes are true
Table1: Example of 5 rows of the synthetic data used for predictive modeling.
Process temperature [K] Mean: 310.01
Process temperature [K] Median: 310.10
Process temperature [K] Max: 313.80
Process temperature [K] Min: 305.70
Process temperature [K] Standard Deviation: 1.48
Process temperature [K] Number of Points: 10000
Figure 1: Example of the distribution observed for 1 synthetic parameters (Process Temperature). Visually data seems not having a significant skew, relatively close to a normal distribution.
Rotational speed [rpm] Mean: 1538.78
Rotational speed [rpm] Median: 1503.00
Rotational speed [rpm] Max: 2886.00
Rotational speed [rpm] Min: 1168.00
Rotational speed [rpm] Standard Deviation: 179.28
Rotational speed [rpm] Number of Points: 10000
Figure 2: Example of the distribution observed for 1 synthetic parameters (Rotational speed [rpm]). On this case data is slightly skewed which in some cases is expected for real machine data.
Table 1 and Figure 1 are examples of what the data looks like; there are no missing data in this dataset as it was synthetically created, which is not normal in a real scenario. Figure 2 shows another parameter where the data are skewed. After observing all parameters, the dataset is a good representation for exploring machine learning algorithms and is able to support conclusions from the analysis.
All detailed data cleaning and exploration can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Data_Exploration_PDM.ipynb
The second dataset consists of a simulated real-time sensor data from industrial machines. Source is also from Kaggle and it can be found here: https://www.kaggle.com/datasets/ziya07/intelligent-manufacturing-dataset/data
Key Features:
Industrial IoT Sensor Data
- Temperature_C, Vibration_Hz, Power_Consumption_kW,
Network Performance:
- Network_Latency_ms, Packet_Loss_%, Quality_Control_Defect_Rate_%
Production Indicators:
- Production_Speed_units_per_hr, Predictive_Maintenance_Score, Error_Rate_%
Target Column Efficiency_Status
| Timestamp | Machine_ID | Operation_Mode | Temperature_C | Vibration_Hz | Power_Consumption_kW | Network_Latency_ms | Packet_Loss_% | Quality_Control_Defect_Rate_% | Production_Speed_units_per_hr | Predictive_Maintenance_Score | Error_Rate_% | Efficiency_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-01-01 00:00:00 | 39 | Idle | 74.137590 | 3.500595 | 8.612162 | 10.650542 | 0.207764 | 7.751261 | 477.657391 | 0.344650 | 14.965470 | Low |
| 1 | 2024-01-01 00:01:00 | 29 | Active | 84.264558 | 3.355928 | 2.268559 | 29.111810 | 2.228464 | 4.989172 | 398.174747 | 0.769848 | 7.678270 | Low |
| 2 | 2024-01-01 00:02:00 | 15 | Active | 44.280102 | 2.079766 | 6.144105 | 18.357292 | 1.639416 | 0.456816 | 108.074959 | 0.987086 | 8.198391 | Low |
| 3 | 2024-01-01 00:03:00 | 43 | Active | 40.568502 | 0.298238 | 4.067825 | 29.153629 | 1.161021 | 4.582974 | 329.579410 | 0.983390 | 2.740847 | Medium |
| 4 | 2024-01-01 00:04:00 | 8 | Idle | 75.063817 | 0.345810 | 6.225737 | 34.029191 | 4.796520 | 2.287716 | 159.113525 | 0.573117 | 12.100686 | Low |
Table 2: Example of 5 rows of the synthetic data that will be used for time series analysis.
Power_Consumption_kW Mean: 5.75
Power_Consumption_kW Median: 5.76
Power_Consumption_kW Max: 10.00
Power_Consumption_kW Min: 1.50
Power_Consumption_kW Standard Deviation: 2.45
Power_Consumption_kW Number of Points: 100000
Figure 3: Example of the distribution observed on second dataset for parameter Power_Consumption_kW. Data does not follow a specific distribution, it seems randomly created over a specific range.
As can be observed in Figure 3, the data from the second dataset seem more randomly created without following a specific distribution. Data in a real scenario for a machine typically follow some type of distribution and are not completely random; a common situation is that processes typically have a target value or values over time, and there is some natural variation around them. The raw data as they are from this source are not usable for the intent of this study. In order to make the data more like a real scenario, a mean was calculated for every 12 hours.
Results from the data transformation using the mean for blocks of 12 hours can be observed in figures 4 and 5.
Power_Consumption_kW Mean: 5.74
Power_Consumption_kW Median: 5.75
Power_Consumption_kW Max: 8.29
Power_Consumption_kW Min: 2.96
Power_Consumption_kW Standard Deviation: 0.67
Power_Consumption_kW Number of Points: 6950
Figure 4: Example of the distribution for the result of grouping the data every 12 hours and calculating the mean for each group.
Figure 5: Example Trend for Power_Consumption_kW for one of the machines.
All detailed data cleaning and exploration for the second dataset can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Data_Exploration_MFG6G.ipynb
Model Training
The first objective of this study is to build a classification model for failures. The model will analyze data from Table 1 to accurately predict the specific failures and failure types.
The initial model will be focused on predicting the Target feature as a binary classification, basically pass or fail based on the dataset. The data were split using 70% for training and 30% for testing using stratification to maintain the distribution of 0s/1s for each set, as the data are highly imbalanced.
AUC ROC Results
| name | ROC_AUC | |
|---|---|---|
| 0 | Nearest_Neighbors | 0.784725 |
| 1 | Gradient_Boosting | 0.821026 |
| 2 | Decision_Tree | 0.932831 |
| 3 | Extra_Trees | 0.910109 |
| 4 | Random_Forest | 0.972481 |
| 5 | Neural_Net | 0.915848 |
| 6 | AdaBoost | 0.899753 |
| 7 | Naive_Bayes | 0.827014 |
| 8 | QDA | 0.857405 |
| 9 | LogisticRegression | 0.880628 |
Table 4. ROC AUC results for multiple models evaluated
F1 Score Results
| name | f1 score | |
|---|---|---|
| 0 | Nearest_Neighbors | 0.423077 |
| 1 | Gradient_Boosting | 0.534884 |
| 2 | Decision_Tree | 0.503145 |
| 3 | Extra_Trees | 0.382353 |
| 4 | Random_Forest | 0.567742 |
| 5 | Neural_Net | 0.184874 |
| 6 | AdaBoost | 0.473373 |
| 7 | Naive_Bayes | 0.198758 |
| 8 | QDA | 0.386740 |
| 9 | LogisticRegression | 0.224000 |
Table 5. F1 Score for multiple models evaluated
This initial exploration of multiple models resulted in very high ROC_AUC scores for most of the models and not so great results for the F1 score. Based on these results we can observed: model might be over-fitting the data, resulting in high scores, and second, because data is highly imbalanced, model is great at predicting 0s (good) as they represent the majority of the sample, but when looking at the F1 score, the precision and recall for predicting the 1s is not the best. For the purposes of this study, predicting the 1s (fails) is the main problem as these are the defects.
Based on this initial results two models will be further evaluated Random Forest Classifer and XGBoost. Fine tuning hyper-parameters and working multiple methods of sampling to reduce or manage the imbalance of the data.
For hyper-parameter tuning a combination of sklearn.model_selection - GridSearchCV and RandomizedSearchCV was used to run over multiple options.
Final hyper-parameters for Random Forest Classifier Model:
model = RandomForestClassifier(n_estimators=50,
max_depth=10,
random_state=42,
max_features='log2',
min_samples_leaf=5,
min_samples_split=5)Results for Random Forest Classifier Model:
RandomForestClassifier: ROC AUC on test dataset: 0.9751
RandomForestClassifier: f1 score on test dataset: 0.6258
Cross validation is used to understand if model is over-fitting:
Cross validation results for Random Forest
fit_time Cross Validation results 0.21
score_time Cross Validation results 0.01
test_accuracy Cross Validation results 0.90
train_accuracy Cross Validation results 0.99
test_precision Cross Validation results 0.69
train_precision Cross Validation results 0.97
test_recall Cross Validation results 0.46
train_recall Cross Validation results 0.72
test_f1 Cross Validation results 0.45
train_f1 Cross Validation results 0.83
test_roc_auc Cross Validation results 0.90
train_roc_auc Cross Validation results 1.00
After tuning the model, the best F1 score obtained is 0.61. Cross-validation results suggest the model is over-fitting and might not be able to generalize.
Different techniques were explored to see if the model would improve over-fitting and also the F1-score. Synthetic Minority Oversampling Technique (SMOTE) from the imblearn library was used; the intent of this method is to over-sample the minority class by creating synthetic data. The results from this were worse than the original model. Additionally, the under-sampling method RandomUnderSampler from the same imblearn library was used; in this case, it was used to reduce the sample of the majority class and try to balance the data. The results were not better than original tuned model.
Additionally, to improve the F1 score, a change in the probability threshold was explored; instead of using the normal 0.5, an analysis was done to estimate the ideal point to optimized the F1-Score.
Figure 6. Values of Recall, Precision and F1 Score metrics for every threshold.
Figure 7. Confusion Matrix for results of Random Forest Classifier for default threshold 0.5.
Figure 8. Confusion Matrix for results of Random Forest Classifier for optimized threshold 0.28.
As observed in Figures 7 and 8, after improving the threshold based on the model results, a balance can be found between precision and recall. Now, this can be tuned to improve one of them, in some cases model might need to be tune if specific direction to reduce over-reject or under-reject.
A second approach is to use XGBoost; this model has the option to handle weights for each class. By adding a higher weight to the minority class, it is expected to handle the imbalance in the dataset better.
XGBoost Results:
F1 Score: 0.757
Cross validation results for Random Forest
fit_time Cross Validation results 0.07
score_time Cross Validation results 0.01
test_accuracy Cross Validation results 0.98
train_accuracy Cross Validation results 1.00
test_precision Cross Validation results 0.74
train_precision Cross Validation results 1.00
test_recall Cross Validation results 0.72
train_recall Cross Validation results 1.00
test_f1 Cross Validation results 0.73
train_f1 Cross Validation results 1.00
test_roc_auc Cross Validation results 0.97
train_roc_auc Cross Validation results 1.00
The F1 score for XGBoost without significant tuning and using weights is slightly better than the random forest model originally used. Cross-validation results are still showing some level of over-fitting but have improved from the original model. Hyperparameter tuning was done with a similar approach as with random forest, but no significant improvement was observed.
Figure 9. Confusion Matrix for results of XGBoost Model.
XGboost results are similar to the random forest optimized threshold model. Based on the cross validation results over-fitting seems better on XGBoost model.
Predicting the Failure Type
Since the XGBoost results were slightly better, this model will be used to go beyond the binary classification and try to predict the different failure modes.
Figure 10. Confusion Matrix results for XGBoost Multi-Class model. Class Names: ‘Heat Dissipation Failure’: 0, ‘No Failure’: 1, ‘Overstrain Failure’: 2, ‘Power Failure’: 3, ‘Random Failures’: 4, ‘Tool Wear Failure’:
In this case for multi-class classification, the model was trained using the same parameters as before. The main difference is how the weights (sample_weight parameter) were estimated; as there are more than one class, an array was calculated. Based on research, a common way to calculate these weights is to count the occurrence of classes and assign weights inversely proportional to this frequency.
Model as with binary classification is great at predicting the majority (no Failure) on this case. Class 0 which is heat-dissipation has a F1 Score of 0.68, Class 2: Overstrain Failure F1 score is 0.53 and Class 3: Power Failure is 0.77 which are similar to original model results. However, on this case model is not able to predict class 4 and 5 (Random Failures,Tool Wear Failure), F1 score is 0 for these.
One last important outcome of these models is to understand what features are really important. Although the model performance is not great for all classes, understanding what features are important in the prediction model can help subject matter experts interpret results and take action to reduce failures and improve the process overall.
Detailed Jupyter Notebook can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Model_Training_PDM.ipynb
Time Series Analysis
The second objective of this study, which is also an important tool in manufacturing, is how to use machine learning and data mining skills to detect outliers or anomalies in the process. A lot of the data that streams from equipment is time-based; it is collected or sampled at a certain frequency. Anomaly detection is very valuable for the industry, as having the ability to know when a machine or process is deviating from the ‘normal’ can help to stop and repair equipment quickly, reducing the impact problems might have due to downtimes and defects.
After pre-processing the data, the study is focused on understanding how algorithms like the seasonal_decompose library and ARIMA (or auto_arima) can be used for anomaly detection. As initially observed, the data were randomly generated over a specific range; it does not follow a trend, and there is no “seasonality” in it.
Figure 11. Results from Deasonal Decompose. Observed data (Original Data), trend (Smothed), seasonal (results from patter fount for the period defined) and Residual (Delta between original data and trend/season components)
Assuming the decomposition is somehow accurate, we can use the residuals to define how each point is deviating from what was expected. Based on this assumption, a rule can be defined to identify what constitutes abnormal behavior.
# Anomalies in residuals
residuals = decomposition.resid.dropna() #Obtain residuals from decompositions
threshold = 2 * residuals.std() # Our rule, on this cases based on research we selected to use 2X the standard deviation of the residuals
anomalies = np.abs(residuals) > threshold # Applying the rule to obtain the anomaliesThe second approach is to use ARIMA (Autoregressive Integrated Moving Average model). An ARIMA model is fitted using Auto_Arima from pmdarima to facilitate the definition of P/D/Q values. Similar to other methods, after fitting the model, we calculate the difference between the actual values and the predicted values and define rules for this difference, as done with the seasonal decomposition.
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 139
Model: SARIMAX Log Likelihood -149.194
Date: Thu, 14 Aug 2025 AIC 302.388
Time: 15:18:34 BIC 308.257
Sample: 01-01-2024 HQIC 304.773
- 03-10-2024
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept 5.8176 0.060 96.802 0.000 5.700 5.935
sigma2 0.5010 0.059 8.537 0.000 0.386 0.616
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 0.16
Prob(Q): 0.98 Prob(JB): 0.92
Heteroskedasticity (H): 1.37 Skew: 0.07
Prob(H) (two-sided): 0.28 Kurtosis: 3.10
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Best (p, d, q): (0, 0, 0)
Table 6. ARIMAX Results Summary
Figure 12. Results from ARIMA model and anomaly points detected based on pre-defined threshold of 2X the standard deviation of the residuals (True = anomaly).
Results from the predictions are almost a constant value around the center of the distribution, meaning the ARIMA model is closely predicting the mean for every single value of the time series data. One reason for this could be that the data are not predictable; as originally stated, they originated from a random generator and were summarized. Why is this still a valid dataset? A real machine, as mentioned before, might be designed to run at a specific target—let’s say, consuming 6 KW—however, there is a normal variation in processes and systems that adds noise. The accuracy and precision of the tool to maintain this target will define how much it varies from this target. Now, this variation is most of the time not predictable; in these cases, models like ARIMA or the seasonal decomposition might not be ideal. LSTM was also explored, with similar results.
Although this dataset might not be ideal for these models, as observed in Figure 12, the model is able to identify the points that deviate the most from the center line, which is what we intended to explore in this analysis. Now, there might be simpler methods to do this in this specific case.
Details of analysis can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Time_Series_Analysis.ipynb
Conclusions
- The study demonstrated the application of concepts in machine learning to a common manufacturing problem.
- The Random Forest Classifier model ROC-AUC scores are high, indicating the model can differentiate effectively between fails and no-fails. However, in a real manufacturing process, the majority of the results are positive/pass or no fails, making this indicator not the best for this case. Recall and precision are more appropriate for this case; depending on the use case, we would want to tune the model in one or the other direction or use the F1-score to optimize both.
- The Random Forest model performed poorly for the F1 score when using a standard threshold of 0.5. The study demonstrated this can be improved by selecting an optimized threshold based on the model results.
- Sampling can be a useful method to handle imbalanced datasets; however, in this specific case, it did not provide a significant improvement in model performance.
- Assigning weights to each classes to handle the imbalance sample in combination with gradient boosting (XGboost) model, resulted in better results for F1-Score.
- Multi-class classification results using the learnings from the binary-classification were demonstrated. Due to the nature and frequency of the failures and their relationship with the input features, two classes had zero F1-scores, meaning the model was not able to predict these. Other classes had an F1-score between 0.5 and 0.7, similar to the binary classification results.
- The study demonstrated how seasonal-decomposition and ARIMA can be used for anomaly detection in a real manufacturing use case; results showed how data points deviating the most from the center/target can be identified by using these methods. These methods might not be the best for a process where there are no patterns and data might just have random variability from the target.
References
Shivam Bansal. “Machine Predictive Maintenance Classification Dataset.” Kaggle. Available at: https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification.
Ziya. “Intelligent Manufacturing Dataset.” Kaggle. Available at: https://www.kaggle.com/datasets/ziya07/intelligent-manufacturing-dataset/data.
INFO-524 University of Arizona. “Comparing Classifiers.” GitHub Notebook. Available at: https://github.com/dataprofessor/code/blob/master/python/comparing-classifiers.ipynb.
L. Lemaître, A. Nogueira, and C. K. Aridas. “SMOTE: Synthetic Minority Over-sampling Technique.” In: imbalanced-learn documentation. Available at: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html.
L. Lemaître, A. Nogueira, and C. K. Aridas. “RandomUnderSampler.” In: imbalanced-learn documentation. Available at: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html.
T. Chen and C. Guestrin. “XGBoost for Imbalanced Classification.” XGBoosting.com. Available at: https://xgboosting.com/xgboost-for-imbalanced-classification.
S. Puranik. “Calinski–Harabasz Index for K-Means Clustering Evaluation.” Towards Data Science, August 2021. Available at: https://towardsdatascience.com/calinski-harabasz-index-for-k-means-clustering-evaluation-using-python-4fefeeb2988e/.
Skipper Seabold and Josef Perktold. “seasonal_decompose — Seasonal Decomposition of Time Series.” statsmodels, accessed 2025. Available at: https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html.
Jason Brownlee. “Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras.” Machine Learning Mastery, March 10, 2018. Available at: https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/.
Explanations, Troubleshooting and clarifications were aided by ChatGPT (OpenAI, 2025). OpenAI. [Large language model]. https://chat.openai.com/